11 research outputs found

    A Possibilistic Query Translation Approach for Cross-Language Information Retrieval

    Get PDF
    International audienceIn this paper, we explore several statistical methods to find solutions to the problem of query translation ambiguity. Indeed, we propose and compare a new possibilistic approach for query translation derived from a probabilistic one, by applying a classical probability-possibility transformation of probability distributions, which introduces a certain tolerance in the selection of word translations. Finally, the best words are selected based on a similarity measure. The experiments are performed on CLEF-2003 French-English CLIR collection, which allowed us to test the effectiveness of the possibilistic approach

    Organizing Contextual Knowledge for Arabic Text Disambiguation and Terminology Extraction.

    Get PDF
    Ontologies have an important role in knowledge organization and information retrieval. Domain ontologies are composed of concepts represented by domain relevant terms. Existing approaches of ontology construction make use of statistical and linguistic information to extract domain relevant terms. The quality and the quantity of this information influence the accuracy of terminologyextraction approaches and other steps in knowledge extraction and information retrieval. This paper proposes an approach forhandling domain relevant terms from Arabic non-diacriticised semi-structured corpora. In input, the structure of documentsis exploited to organize knowledge in a contextual graph, which is exploitedto extract relevant terms. This network contains simple and compound nouns handled by a morphosyntactic shallow parser. The noun phrases are evaluated in terms of termhood and unithood by means of possibilistic measures. We apply a qualitative approach, which weighs terms according to their positions in the structure of the document. In output, the extracted knowledge is organized as network modeling dependencies between terms, which can be exploited to infer semantic relations.We test our approach on three specific domain corpora. The goal of this evaluation is to check if our model for organizing and exploiting contextual knowledge will improve the accuracy of extraction of simple and compound nouns. We also investigate the role of compound nouns in improving information retrieval results

    A hybrid approach for arabic semantic relation extraction

    Get PDF
    Information retrieval applications are essential tools to manage the huge amount of information in the Web. Ontologies have great importance in these applications. The idea here is that several data belonging to a domain of interest are represented and related semantically in the ontology, which can help to navigate, manage and reuse these data. Despite of the growing need of ontology, only few works were interested in Arabic language. Indeed, arabic texts are highly ambiguous, especially when diacritics are absent. Besides, existent works does not cover all the types of se-mantic relations, which are useful to structure Arabic ontol-ogies. A lot of work has been done on cooccurrence- based techniques, which lead to over-generation. In this paper, we propose a new approach for Arabic se-mantic relation extraction. We use vocalized texts to reduce ambiguities and propose a new distributional approach for similarity calculus, which is compared to cooccurrence. We discuss our contribution through experimental results and propose some perspectives for future research

    Construction et intégration d'ontologies pour la cartographie socio-sémantique de fonds documentaires arabes guidée par la fiabilité de l'information

    No full text
    La présente thèse propose un processus de cartographie des connaissances de fonds documentaires arabes. L'objectif principal de ce processus est de permettre à des utilisateurs différents de retrouver l'information pertinente qu'ils recherchent. Etant conscient que la pertinence est une notion multidimensionnelle, nous avons conçu un modèle générique pour représenter des cartes de connaissances multi-critères. En effet, une carte est composée d'un ensemble d'ontologies (dont chacune représente une dimension) qui sont liées aux fragments de documents. Les cartes sont munies de mécanismes d'évaluation de l'information selon les besoins des utilisateurs. A ce stade, nous avons donné une importance primordiale à la fiabilité de l'information en tant qu'exigence critique dans la situation actuelle du Web. Nous avons adopté le point de vue du Web socio-sémantique qui considère les documents comme des productions sémiotiques. Un autre choix primordial, effectué dans le cadre de cette thèse, consiste à utiliser le corpus hadithien qui est un fonds documentaire volumineux, structuré et riche en connaissances et en divergences. En outre, le hadith constitue une méthodologie solide pour assurer la fiabilité de l'information. De part ces caractéristiques, les livres du hadith constituent des productions sémiotiques adaptées aux traitements socio-sémantiques. La représentation multidimensionnelle nécessite l'extraction et l'organisation des connaissances selon plusieurs axes. Dans l'axe sémantique, nous proposons d'extraire les termes pertinents à chaque thème, considéré comme un domaine de connaissances. Dans l'axe social, nous proposons un moteur de recherche social qui permet d'extraire les entités nommées et de reconnaître les identités des acteurs. Les connaissances extraites sont organisées en utilisant la méthode d'analyse distributionnelle basée sur les réseaux petits mondes hiérarchiques, ce qui permet de construire des ontologies différentielles. Enfin, nous intégrons les réseaux possibilistes en tant qu'outil d'évaluation de l'information. Ainsi, l'utilisateur dispose du jugement du système sur la pertinence thématique et sur la fiabilité, mais aussi des outils nécessaires pour conduire une démarche d'enquête dans une perspective de recherche ouverte de l'information

    Adapting Pre-trained Language Models to Rumor Detection on Twitter

    No full text
    Fake news has invaded social media platforms where false information is being propagated with malicious intent at a fast pace. These circumstances required the development of solutions to monitor and detect rumor in a timely manner. In this paper, we propose an approach that seeks to detect emerging and unseen rumors on Twitter by adapting a pre-trained language model to the task of rumor detection, namely RoBERTa. A comparison against content-based characteristics has shown the capability of the model to surpass handcrafted features. Experimental results show that our approach outperforms state of the art ones in all metrics and that the fine tuning of RoBERTa led to richer word embeddings that consistently and significantly enhance the precision of rumor recognition

    Evaluation d'une approche possibiliste pour la désambiguïsation des textes arabes (TALN'2014 – Traitement Automatique des Langues Naturelles, Marseille France, 01/07/14-04/07/14)

    No full text
    Papier long de TALN 2014International audienceMorphological disambiguation of Arabic words consists in identifying their appropriate morphological analysis. In this paper, we present three models of morphological disambiguation of non-vocalized Arabic texts based on possibilistic classification. This approach deals with imprecise training and testing datasets, as we learn from untagged texts. We experiment our approach on two corpora i.e. the Hadith corpus and the Arabic Treebank. These corpora contain data of different types: traditional and modern. We compare our models to probabilistic and statistical classifiers. To do this, we transform the structure of the training and the test sets to deal with imprecise data.La désambiguïsation morphologique d'un mot arabe consiste à identifier l'analyse morphologique appropriée correspondante à ce mot. Dans cet article, nous présentons trois modèles de désambiguïsation morphologique de textes arabes non voyellés basés sur la classification possibiliste. Cette approche traite les données imprécises dans les phases d'apprentissage et de test, étant donné que notre modèle apprend à partir de données non étiquetés. Nous testons notre approche sur deux corpus, à savoir le corpus du Hadith et le Treebank Arabe. Ces corpus contiennent des données de types différents classiques et modernes. Nous comparons nos modèles avec des classifieurs probabilistes et statistiques. Pour ce faire, nous transformons la structure des ensembles d'apprentissage et de test pour remédier au problème d'imperfection des données

    Improving Query Expansion by Automatic Query Disambiguation in Intelligent Information Retrieval

    No full text
    International audienceWe study in this paper the impact of WordSense Disambiguation (WSD) on Query Expansion (QE) for monolingual intelligent information retrieval. The proposed approaches for WSD and QE are based on corpus analysis using co-occurrence graphs modelled by possibilistic networks. Indeed, our model for relevance judgment uses possibility theory to take advantages of a double measure (possibility and necessity). Our experiments are performed using the standard ROMANSEVAL test collection for the WSD task and the CLEF-2003 benchmark for the QE process in French monolingual Information Retrieval (IR) evaluation. The results show the positive impact of WSD on QE based on the recall/precision standard metrics

    A Comparative Study between Possibilistic and Probabilistic Approaches for Monolingual Word Sense Disambiguation

    No full text
    International audienceThis paper proposes and assesses a new possibilistic approach for automatic monolingual word sense disambiguation (WSD). In fact, in spite of their advantages, the traditional dictionaries suffer from the lack of accurate information useful for WSD. Moreover, there exists a lack of high-coverage semantically labeled corpora on which methods of learning could be trained. For these multiple reasons, it became important to use a semantic dictionary of contexts (SDC) ensuring the machine learning in a semantic platform of WSD. Our approach combines traditional dictionaries and labeled corpora to build a SDC and identify the sense of a word by using a possibilistic matching model. Besides, we present and evaluate a second new probabilistic approach for automatic monolingual WSD. This approach uses and extends an existing probabilistic semantic distance to compute similarities between words by exploiting a semantic graph of a traditional dictionary and the SDC. To assess and compare these two approaches, we performed experiments on the standard ROMANSEVAL test collection and we compared our results to some existing French monolingual WSD systems. Experiments showed an encouraging improvement in terms of disambiguation rates of French words. These results reveal the contribution of possibility theory as a mean to treat imprecision in information systems

    Towards a New Standard Arabic Test Collection for Mono- and Cross-Language Information Retrieval (poster)

    No full text
    International audienceWe propose in this paper a new standard Arabic test collection for mono- and cross-language Information Retrieval (CLIR). To do this, we exploit the “Hadith” texts and we provide a portal for sampling and evaluation of Had-iths' results listed in both Arabic and English versions. The new called “Kunuz” standard Arabic test collection will promote and restart the development of Ar-abic mono retrieval and CLIR systems blocked since the earlier TREC-2001 and TREC-2002 editions
    corecore